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Abstract 

Many  recent  connectionist  models  can  be  categorized  as  associative  memories  or  pattern  classifiers. 
Viewed  at  the  right  level  of  abstraction,  the  two  are  the  same.  Connectionists  sometimes  appear  to  be 
trying  to  squeeze  all  of  cognition  into  the  associative  memory  paradigm,  perhaps  because  it’s  the  only  thing 
they  know  how  to  implement  with  gradient  descent  learning  algorithms.  But  the  combinatorial  structure 
of  thought  and  language  indicates  that  the  answer  to  “How  can  slow  components  think  so  fast”  lies  beyond 
mere  associative  recall.  We  must  search  for  additional  cognitive  primitives  that  can  be  implemented  in 
parallel  hardware.  One  modest  successor  to  associative  recall  is  considered  here. 

1  Associative  Memory  and  Pattern  Classification 

If  an  associative  memory  is  trained  on  a  set  of  patterns  {/>,}  and  then  exposed  to  a  novel  pattern  P‘,  it 
will  usually  map  the  new  pattern  to  the  closest  Pi.  In  matrix  autoassociators  such  as  Hopfield  nets  or 
Anderson's  “brain  state  in  a  box"  model,  any  subset  of  the  pattern  can  serve  as  the  cue  for  retrieving 
the  whole.  These  models  contain  recurrent  connections,  so  their  behavior  is  described  by  differential 
equations  and  they  typically  require  several  iterations  to  settle  into  a  stable  state.  Their  memories  are 
fixedpoints  of  a  dynamical  system. 

A  second  family  of  associative  memory  models  permits  only  feed-forward  connections;  the  weights  are 
learned  by  backward  error  propagation  or  the  perceptron  learning  rule.  Feedforward  models  can  acquire 
much  wider  sets  of  memories  than  matrix  models,  by  using  intermediate  or  “hidden”  units  to  recode  the 
input  as  necessary  [12].  A  hetero-associator  which  maps  an  input  pattern  U  to  an  output  pattern  O,  can 
be  viewed  as  an  auto-associator  in  which  /*,  »  { /,•.£?,• )  and  P"  =  {/* .  0) . 

A  third  family  of  associative  models,  which  includes  competitive  learning  and  adaptive  resonance 
networks,  maintain*  a  prototype  for  each  learned  class.  A  prototype  is  represented  by  the  connections 
made  to  a  dedicated  unit.  Novel  inputs  are  associated  with  the  prototype  whose  unit  they  excite  most 
strongly.  Learning  Dew  prototypes  requires  recruiting  new  units. 
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Associative  architectures  have  been  applied  to  three  broad  classes  of  problems:  pattern  classification, 
analog  function  approximation,  and  pattern  transformation.  When  used  for  pattern  classification,  each  class 
of  patterns  is  named  by  the  single  O,  to  which  all  /*  in  the  class  are  mapped.  When  used  for  function 
approximation,  the  goal  is  to  produce  an  O’  close  to  /(/*)  after  training  on  a  set  of  pairs  (  /;•/(/,».  In 
pattern  transformation  problems  the  output  pattern  contains  pans  of  the  input  rearranged  or  transformed 
in  some  discrete,  systematic  way.  The  early  success  of  associative  memories  in  the  first  two  areas  has 
obscured  some  important  facts  about  their  limitations  in  the  third. 

Perhaps  the  best-known  connectionist  pattern  classifier  is  Sejnowski  and  Rosenberg’s  NET  talk  pro¬ 
gram  [14].  NET  talk  maps  a  seven  letter  window  of  English  text  into  a  26  bit  output  pattern  that  determines 
the  pronunciation  of  the  window's  middle  letter  in  that  context.  The  output  consists  of  23  articulatory 
features  plus  three  bits  for  stress  and  syllable  boundary  information.  By  scanning  the  window  across  a 
body  of  text  and  using  the  back  propagation  learning  algorithm  to  train  the  weights,  NET talk  can  be  taught 
to  “read  aloud.”  What  NET  talk  actually  learns  is  a  set  of  26  binary  decision  problems,  which  it  solves  in 
parallel.  However,  since  the  26  output  units  share  the  same  set  of  hidden  units,  and  certain  correlations 
exist  among  the  outputs  (e.g.,  the  articulatory  features  LABIAL  and  VELA*  are  mutually  excluaive),  the  26 
classification  problems  aren’t  learned  independently. 

Lapedes  and  Farber  have  shown  that  back  propagation  can  learn  to  predict  the  behavior  of  very  complex 
(chaotic)  functions  with  better  accuracy  than  previous  numerical  techniques  [7],  Function  approximation 
by  back  propagation  is  especially  promising  for  control  problems,  such  as  predicting  the  dynamics  of  a 
robot  arm  from  a  collection  of  actual  trajectories. 

One  of  the  interesting  properties  of  back  propagation  is  that  the  hidden  units  evolve  to  detect  regularities 
in  the  input  space.  Hinton’s  work  on  learning  the  structure  of  family  trees  offers  a  particularly  striking 
example  of  this  effect  [5].  But  the  regularities  the  network  discovers  are  not  accessible  to  introspection; 
the  network  cannot  reason  about  what  it  has  learned.  For  example,  the  network  has  no  way  to  compare 
the  regularities  discovered  in  one  group  of  input  units  with  those  discovered  in  other  groups. 

2  Pattern  Transformation  by  Multilayer  Perceptrons 

A  pattern  transformation  machine  is  potentially  computationally  universal,  since  any  function  may  be 
described  as  a  transformation  from  inputs  to  outputs.  But  the  class  of  transformations  that  are  leamable 
in  reasonable  time  by  direct  induction  from  training  data  is  limited. 

Allen  reports  experiments  in  using  back  propagation  to  translate  English  sentences  into  Spanish  [1]. 
Starting  with  a  context-free  subset  of  English  with  3300  sentences  in  it,  when  the  network  was  trained 
on  99%  of  diem  it  showed  good  performance  on  the  remaining  1%  (33  sentences).  This  demonstrates 
that  certain  kinds  of  complex  syntactic  operations  can  be  learned  by  back  propagation  -  but  only  if  one 
is  willing  to  pay  the  price  of  near-exhaustive  enumeration. 

One  of  the  best-known  pattern  transformation  machines  is  the  Rumelhart  and  McClelland  verb  learning 
model  [13].  This  model  takes  as  input  a  phonetic  representation  of  a  present  tense  English  verb  and 
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produces  as  output  the  phonetic  representation  of  the  past  tense.  The  model  has  a  single  layer  of  trainable 
weights.  After  training  on  a  mixture  of  regular  and  irregular  verbs,  the  model  is  able  to  predict  the 
past  tense  of  novel  verbs  by  transforming  their  phonetic  representations  in  accordance  with  the  rules  that 
govern  past  tense  formation  in  English.  It  induced  these  rules  from  the  training  data. 

Another  very  interesting  pattern  transformation  machine  is  the  case  role  assignment  model  of  McClel¬ 
land  and  Kawamoto  [8],  which  was  able  to  correctly  assign  the  roles  agent,  patient,  with-PP-modifier,  and 
instrument  to  noun  phrases  in  novel  sentences  while  simultaneously  performing  lexical  disambiguation. 
As  in  the  verb  learning  model  and  the  NET  talk  model,  what  this  feedforward  network  was  really  doing 
was  solving  a  collection  of  binary  pattern  classification  problems  in  synchrony. 

3  Why  Associative  Memory  Is  Insufficient 

The  triumph  of  associative  models  is  that  they  provide  evidence  in  support  of  the  connectionist  hypoth¬ 
esis  that  problems  that  appear  to  require  sequential,  symbolic  inference  may  be  amenable  to  parallel 
solution  by  some  sort  of  continuous  dynamical  system  [15].  But  the  evidence  is  only  suggestive.  No  one 
should  believe  that  human-like  cognitive  mechanisms  are  efficiently  constructible  by  back  propagation. 
Rule-following  behavior  can  be  successfully  approximated  by  back  propagation  networks  only  when  the 
behavior  is  relatively  simple  (no  combinatorial  structure),  or  the  training  data  covers  moat  of  die  input 
space,  as  in  Allen’s  language  translation  experiment.  In  other  words,  the  brute  force  associative  approach 
to  intelligence  doesn’t  scale. 

In  focusing  on  simple,  trainable  pattern  recognition  machines,  connectionists  appear  to  be  ducking 
the  hard  problems  in  cognition,  as  critics  of  the  approach  have  been  quick  to  point  out.  These  problems 
include: 

•  The  compositional  structure  of  language  [4],  People’s  ability  to  correctly  interpret  novel  sentences 
composed  of  familiar  words,  and  to  put  words  together  in  novel  ways,  cannot  be  reduced  to  learning 
correlations  between  input  and  output  patterns  in  a  back  propagation  network  -  unless  the  training 
set  is  near-exhaustive. 

•  The  need  to  express  complex  relations  between  objects,  as  in  Drew  McDermott’s  “she  is  more 
outgoing  with  her  fellow  graduate  students  than  with  me,  her  advisor’’  [9].  The  ability  to  relate 
concepts  not  previously  juxtaposed  is  part  of  what  distinguishes  inference  from  mere  associative 
recall. 

•  The  need  for  variables  and  variable  binding  [11].  Constraints  imposed  by  certain  uses  of  variable 
binding  may  well  be  the  primary  serializing  constraint  on  cognition  [10],  but  variable  binding  is 
nonetheless  essential  for  retrieving  and  applying  knowledge.  Variable  binding  cannot  be  replaced 
by  associative  memory  unless  one  trains  on  virtually  all  the  variable  values  in  advance. 

There  have  been  some  attempts  at  addressing  these  problems  by  designing  specialized  network  archi¬ 
tectures.  These  include  Touretzky  and  Hinton's  distributed  connectionist  production  system,  DCPS  [18]; 
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Touretzky’j  BoltzCONS,  an  architecture  for  manipulating  lmked-Ust  structures  [16];  Derthick’s  micro- 
KLONE:  a  coonectionist  version  of  the  KL-ONE  family  of  knowledge  representation  languages  [2];  Dyer 
and  Dolan’s  coonectionist  model  of  role  instantiation  in  scripts  [3];  and  Touretzky  and  Geva’s  DUCS, 
a  coonectionist  frame  system  [17],  These  models  did  not  evolve  by  subjecting  a  network  with  random 
initial  weights  to  voluminous  amounts  of  training  data.  They  were  designed  methodically,  using  such 
techniques  as  coarse  coding,  lateral  inhibition,  and  simulated  annealing  search  to  produce  particular  sorts 
of  symbol  processing  behavior. 

4  Another  Cognitive  Primitive 

In  this  last  section  of  the  paper  I  will  focus  on  a  simple  cognitive  primitive,  “appropriate  substitution," 
that  highlights  the  failure  of  the  simple  associative  approach,  but  I  believe  is  ripe  for  coonectionist 
implementation.  Consider  the  linguistic  phenomenon  known  as  metonymy,  in  which  one  concept  plays 
the  part  of  another.  An  example  taken  from  Lakoff  [6]  is  the  waitress’  observation 

(1)  The  ham  sandwich  just  spilled  beer  all  over  himself. 

Here  the  sandwich  is  standing  in  for  the  customer.  Metonymy  is  quite  common  in  language.  Consider 

(2)  John  cut  an  apple  from  the  tree. 

What  John  actually  cut  was  neither  the  apple  nor  the  tree,  but  rather  die  stem  that  connected  the  apple 
to  the  tree.  To  correctly  understand  (2)  one  first  needs  a  way  of  detecting  that  the  unique  selectional 
restrictions  created  by  the  combination  of  “cut”  and  “from  the  tree”  are  not  met  by  the  direct  object 
“apple.”  This  motivates  a  search  for  something  related  to  “apple”  -  in  an  appropriate  way  -  that  could 
substitute  for  it  as  the  patient  in  the  cut  act.  The  problem  is  complicated  by  the  fact  that  the  meanings 
of  the  other  words  in  the  sentence  aren't  fixed.  In  particular,  “cut”  is  a  polysemous  verb:  its  meanings 
include  sever,  section,  slice,  stab,  excise,  dilute,  diminish,  traverse,  and  move  quickly. 

Comprehending  (2)  requires  the  knowledge  that  apples  are  connected  to  trees  by  stems,  and  one  of 
the  senses  of  cut  is  to  sever  a  connection.  Metonymy  clearly  involves  some  sort  of  associative  faculty, 
but  not  the  simple  retrieval  described  earlier.  One  does  not  want  to  map  apple  to  stem  in  all  contexts. 
Apple  certainly  doesn’t  mean  stem  in  either  of  these  sentences: 

(3)  John  ate  an  apple  from  the  tree. 

(4)  John  cut  an  apple  beneath  the  tree. 

The  domain  of  metonymic  reference  is  so  rich  that  one  cannot  hope  to  cover  the  space  with  a  simple 
associative  memory,  except  by  using  the  entire  sentence  as  context  and  training  on  an  exponential  number 
of  examples.  But  a  hand-designed  metonymy  marhine  -  one  that  can  dynamically  bring  together  the  bits 
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of  knowledge  it  needs  -  should  be  able  to  interpret  truly  novel  metonymic  references  without  exhaustive 
training.  Even  though  it  might  never  have  considered  cutting  anything  like  a  mooring  rope  before,  a 
metonymy  machine  should  have  no  trouble  inferring  what  is  severed  in 

(5)  John  cut  the  boat  from  the  dock. 

In  my  lab  at  Carnegie  Mellon  we  have  begun  working  on  the  design  of  such  a  metonymy  machine. 
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